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Abstract 

Multi-step ahead forecasting is still an open challenge in time series forecasting. Several approaches 
that deal with this complex problem have been proposed in the literature but an extensive compar- 
ison on a large number of tasks is still missing. This paper aims to fill this gap by reviewing existing 
strategies for multi-step ahead forecasting and comparing them in theoretical and practical terms. 
To attain such an objective, we performed a large scale comparison of these different strategies us- 
ing a large experimental benchmark (namely the 111 series from the NN5 forecasting competition). 
In addition, we considered the effects of deseasonalization, input variable selection, and forecast 
combination on these strategies and on multi-step ahead forecasting at large. The following three 
findings appear to be consistently supported by the experimental results: Multiple-Output strate- 
gies are the best performing approaches, deseasonalization leads to uniformly improved forecast 
accuracy, and input selection is more effective when performed in conjunction with deseasonaliza- 
tion. 

Keywords: Time series forecasting, Multi-step ahead forecasting, Long-term forecasting, 
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1. Introduction 



Time series forecasting is a growing field of interest playing an important role in nearly all fields 



of science and engineering, such as economics, finance, meteorology and telecommunication (Palit 



and Popovic 2005). Unlike one-step ahead forecasting, multi-step ahead forecasting tasks are more 



difficult (Tiao and Tsay 1994), since they have to deal with various additional complications, like 



accumulation of errors, reduced accuracy, and increased uncertainty (Weigend and Gershenfeld 



1994 Sorjamaa et al. 2007). 



The forecasting domain has been influenced, for a long time, by linear statistical methods such 
as ARIMA models. However, in the late 1970s and early 1980s, it became increasingly clear that 



linear models are not adapted to many real applications (Gooijer and Hyndman, 2006). In the same 



period, several useful nonlinear time series models were proposed such as the bilinear model (Poskitt 



and Tremayne, 1986), the threshold autoregressive model (Tong and Lim, 1980; Tong, 1983, 1990) 



and the autoregressive conditional heteroscedastic (ARCH) model (Engle, 1982| ) (see (Gooijer and 



Hyndman, 2006) and (Gooijer and Kumar 1992) for a review). Nowadays, Monte Carlo simulation 



or bootstrapping methods are used to compute nonlinear forecasts. Since no assumptions are made 



about the distribution of the error process, the latter approach is preferred (Clements et al. 2004 



Gooijer and Hyndman 2006). However, the study of nonlinear time series analysis and forecasting 



is still in its infancy compared to the development of linear time series (Gooijer and Hyndman 



2006). 



In the last two decades, machine learning models have drawn attention and have established 



themselves as serious contenders to classical statistical models in the forecasting community (Ahmed 



et al. 2010 Palit and Popovic, 2005 Zhang et al. 1998). These models, also called black-box or 



data-driven models (Mitchell, 1997), are examples of nonparametric nonlinear models which use 
only historical data to learn the stochastic dependency between the past and the future. For in- 
stance, Werbos found that Artificial Neural Networks (ANNs) outperforms the classical statistical 



methods such as linear regression and Box- Jenkins approaches (Werbos 1974, 1988). A similar 



study has been conducted by Lapedes and Farber (Lapedes and Farber, 1987) who conclude that 
ANNs can be successfully used for modeling and forecasting nonlinear time series. Later, others 
models appeared such as decision trees, support vector machines and nearest neighbor regres- 



sion (Hastie et al. 2009 Alpaydin, 2010). Moreover, the empirical accuracy of several machine 



learning models has been explored in a number of forecasting competitions under different data 
conditions (e.g. the NN3, NN5, and the annual ESTSP competitions (Crone, a|b Lendasse, 2007 



2008)) creating interesting scientific debates in the area of data mining and forecasting (Hand 



2008 Price, 2009 Crone 2009). 



In the forecasting community, researchers have paid attention to several aspects of the fore- 



casting procedure such as model selection (Aha, 1997 Curry and Morgan 2006 Anders and Korn 



1999 Chapelle and Vapnik, 2000), effect of deseasonalization (Hylleberg, 1992; Makridakis et al. 



1998 


Nelson et al. 


1999 


Zhang and Qi 


2005) 



W. J. 1969; Clemen, 1989 Timmermann, 2006) and many other critical topics (Gooijer and Hyn- 



dman, 2006). However, approaches for generating multi-step ahead forecasts for machine learning 



models did not receive as much attention, as pointed out by Kline: 11 One issue that has had limited 



investigation is how to generate multiple-step-ahead forecasts" (Kline, 2004). 

To the best of our knowledge, five alternatives (or strategies) have been proposed in the litera- 



ture to tackle an H-step ahead forecasting task. The Recursive strategy (Weigend and Gershenfeld 



1994; Sorjamaa et al. 


2007 


Cheng et al. , 


2006; 


Tiao and Tsay, 1994, 


Kline , 


2004; 


Hamzaebi et al. 



2009) iterates, H times, a one-step ahead forecasting model to obtain the H forecasts. After 



estimating the future series value, it is fed back as an input for the following forecast. 



In contrast to the previous strategy which use a single model, the Direct strategy (Weigend 



and Gershenfeld 


1994| Sorjamaa et al. 


2007 


Cheng et al. 


2006 


Tiao and Tsay, 1994; 


Kline , 


2004 



Hamzaebi et al. , 2009| ) estimates a set of H forecasting models, each returning a forecast for the 



i-th value (i 6 {1, . . . ,H}). 



A combination of the two previous strategies, called DirRec strategy has been proposed in (Sor- 



jamaa and Lendasse, 2006). The idea behind this strategy is to combine aspects from both, the 
Direct and the Recursive strategies. In other words, a different model is used at each step but the 
approximations from previous steps are introduced into the input set. 

In order to preserve, between the predicted values, the stochastic dependency characterizing 
the time series, the Multi-Input Multi-Output (MIMO) strategy has been introduced and analyzed 



in (Bontempi, 2008 Bontempi and Ben Taieb, 2011; Kline 2004). Unlike the previous strategies 
where the models return a scalar value, the MIMO strategy returns a vector of future values in a 
single step. 
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The last strategy, called DIRMO (Ben Taieb et al. 2009), aims to preserve the most appealing 
aspects of both the DIRect and miMO strategies. This strategy aims to find a trade-off between the 
property of preserving the stochastic dependency between the forecasted values and the flexibility 
of the modeling procedure. 

In the literature, these five forecasting strategies have been presented separately, sometimes, 
using different terminologies. The first contribution of this paper is to present a thorough unified 
review as well as a theoretical comparative analysis of the existing strategies for multi-step ahead 
forecasting. 

Despite the fact that many studies have compared between the different multi-step ahead 
approaches, the collective outcome of these studies regarding forecasting performance has been 
inconclusive. So the modeler is still left with little guidance as to which strategy to use. For exam- 



ple, research from (Bontempi et al. 1999 Weigend et al. 1992) provide experimental evidence in 



favor of Recursive strategy against Direct strategy. However, results from (Zhang and Hutchinson 



1994 Sorjamaa et al. 2007 Hamzaebi et al. 2009) support the fact that the Direct strategy is 



better than the Recursive strategy. The work by (Sorjamaa and Lendasse, 2006) shows that the 
DirRec strategy gives better performance than Direct and Recursive strategies. The Direct and 



Recursive strategies have been theoretically and empirically compared in (Atiya et al. 1999). In 
this study the authors obtained theoretical and experimental evidence in favor of Direct strategy. 



Concerning the MIMO strategy, Kline (Kline, 2004) and Cheng et al (Cheng et al. 2006) support 
the idea that MIMO strategy provides worse forecasting performance than Recursive and Direct 



strategies. However, in (Bontempi 2008 Bontempi and Ben Taieb, 2011), the comparison between 



MIMO, Recursive, and Direct was in favor of MIMO. Finally, ( |Ben Taieb et al.[ |2009H2010[ ) show 
that the DIRMO strategy gives better forecasting results than Direct and MIMO strategies when 
the parameter controlling the degree of dependency between forecasts is correctly identified. These 
previous comparisons have been performed with different datasets in different configurations us- 
ing different forecasting methods, such as Multiple Linear Regression, Artificial Neural Networks, 
Hidden Markov Models and Nearest Neighbors. 

All the contradictory findings of these studies make it all the more necessary to investigate 
further to find the truth concerning the relative performance of these strategies. The second con- 
tribution of this paper is an experimental comparison of the different multi-step ahead forecasting 
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strategies on the 111 time series of the NN5 international forecasting competition benchmark. 
These time series pose some of the realistic problems that one usually encounters in a typical 
multi-step ahead forecasting task, for example the existence of several times series of possibly re- 
lated dynamics, outliers, missing values, and multiple overlying seasonalities. This experimental 
comparison is performed for a variety of different configurations (regarding seasonality, input se- 
lection and combination), in order to have the comparison as encompassing as can be. In addition, 
the methodology used for this experimental comparison is based on the guidelines and recom- 



mendations advocated in some of the methodological papers (Demsar 2006 Garca and Herrera 



2009) 



In other words, the aim of this paper is not to make a comparison of machine learning algorithms 



for forecasting (which was already conducted in (Ahmed et al. , 2010)) but rather to show for a given 



learning algorithm, how the choice of the forecasting strategy can sensibly influence the performance 



of the multi-step ahead forecasts. In this work, we adopted the Lazy Learning algorithm (Aha 



1997), a particular instance of local learning, which has been successfully applied to many real- world 



forecasting tasks (Sauer, 1994 Bontempi et al. 1998, McNames 1998) 



Last but not least, the paper proposes also a Lazy Learning entry to the NN5 forecasting com- 



petition (Crone b ). The goal is to assess how this model fares compared to the other computational 



intelligence models that were proposed for the competition (Bontempi and Ben Taieb 2011). This 
will give us an idea about the potential of this approach. 

The paper is organized as follows. The next section presents a review of the different forecasting 
strategies. Section [3] describes the Lazy Learning model and the associated algorithms for the 
different forecasting strategies. Section [4] gives a detailed presentation of the datasets and the 
methodology applied for the experimental comparison. Section [5] presents the results and discusses 
them. Finally, Section [6] gives a summary and concludes the work. 

2. Strategies for Multi-Step- Ahead Time Series Forecasting 

A multi-step ahead (also called long-term) time series forecasting task consists of predicting the 
next H values [un+i, ■ ■ ■ , Un+h] of a historical time series [y%, . . . , i/n] composed of N observations, 
where H > 1 denotes the forecasting horizon. 

This section will first give a presentation of the five forecasting strategies and next, a subsection 
will be devoted to a comparative analysis of these strategies in terms of number and types of models 
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to learn as well as forecasting properties. 

We will use a common notation where / and F denote the functional dependency between past 



and future observations, d refers to the embedding dimension (Casdagli et al. 1991) of the time 



series, that is the number of past values used to predict future values and w represents the term 
that includes modeling error, disturbances and/or noise. 

2.1. Recursive strategy 

The oldest and most intuitive forecasting strategy is the Recursive (also called Iterated or 



Multi-Stage) strategy (Weigend and Gershenfeld 1994; Sorjamaa et al. 2007; |Cheng et al. 2006 



Tiao and Tsay, 1994 Kline 2004 Hamzaebi et al. 2009). In this strategy, a single model / is 



trained to perform a one-step ahead forecast, i.e. 



yt+i = f{vt,-- -,yt-d+i) + w, 



(1) 



with t € {d, . . . ,N- 1}. 

When forecasting H steps ahead, we first forecast the first step by applying the model. Sub- 
sequently, we use the value just forecasted as part of the input variables for forecasting the next 
step (using the same one-step ahead model). We continue in this manner until we have forecasted 
the entire horizon. 

Let the trained one-step ahead model be /. Then the forecasts are given by: 



if h = 1 
,VN-d+h) if he {2,...,d} 



f(VN, ■ • - ,2/iV-d+l) 

■ y ' v +''' ~ ^ fim+h-i, • • • , m+i,VN, 

KilN+h-i, • • • , VN+h-d) if h e {d+ 1, ... ,H} 

Depending on the noise present in the time series and the forecasting horizon, the recursive 
strategy may suffer from low performance in multi-step ahead forecasting tasks. Indeed, this is 
especially true if the forecasting horizon h exceeds the embedding dimension d, as at some point 
all the inputs are forecasted values instead of actual observations (Equation [2]) . The reason for the 
potential inaccuracy is that the Recursive strategy is sensitive to the accumulation of errors with 
the forecasting horizon. Errors present in intermediate forecasts will propagate forward as these 
forecasts are used to determine subsequent forecasts. 
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In spite of these limitations, the Recursive strategy has been successfully used to forecast 
many real- world time series by using different machine learning models, like recurrent neural net- 



works (Saad et al. 1998) and nearest-neighbors (McNames, 1998 Bontempi et al. 1999). 



2.2. Direct strategy 



The Direct (also called Independent) strategy (Weigend and Gershenfeld 1994 Sorjamaa et al. 



2007 Cheng et al. 2006| Tiao and Tsay 1994 Kline 2004 Hamzaebi et al. 2009) consists of fore 



casting each horizon independently from the others. In other terms, H models fh are learned (one 
for each horizon) from the time series [yi, ■ ■ ■ ,Vn] where 



Vt+h = fhiVt, Vt-d+l) + w, 
with t € {d, . . . , N - H} and h € {1, . . . , H}. 

The forecasts are obtained by using the H learned models fh as follows: 



(3) 



VN+h = fh(VN, 



,VN~d+l, 



(4) 



This implies that the Direct strategy does not use any approximated values to compute the 
forecasts (Equation [4]), being then immune to the accumulation of errors. However, the H models 
are learned independently inducing a conditional independence of the H forecasts. This affects the 
forecasting accuracy as it prevents the strategy from considering complex dependencies between 



the variables y^+h ( Bontempi 2008; Bontempi and Ben Taieb, 2011 Kline, 2004). For example 
consider a case where the best forecast is a linear or mildly nonlinear trend. The direct method 
could yield a broken curve because of the "uncooperative" way the H forecasts are generated. Also, 
this strategy demands a large computational time since there are as many models to learn as the 
size of the horizon. 

Different machine learning models have been used to implement the Direct strategy for multi- 



step ahead forecasting tasks, for instance neural networks (Kline 2004), nearest neighbors (Sorja- 



maa et al. 2007) and decision trees (Iran et al. 2009). 



2.3. DirRec strategy 



The DirRec strategy (Sorjamaa and Lendasse, 2006) combines the architectures and the prin 



ciples underlying the Direct and the Recursive strategies. DirRec computes the forecasts with 



different models for every horizon (like the Direct strategy) and, at each time step, it enlarges 
the set of inputs by adding variables corresponding to the forecasts of the previous step (like the 
Recursive strategy). However, note that unlike the two previous strategies, the embedding size d 
is not the same for all the horizons. In other terms, the DirRec strategy learns H models fh from 
the time series . . . , yjy] where 



Vt+h = fh(Vt+h-i, ■ ■ -,yt-d+i) + w, 



(5) 



with t G {d, . . . , N - H} and h G {1, . . . , H}. 

To obtain the forecasts, the H learned models are used as follows: 



UN+h 



(6) 



fh(yN,---,VN-d+i) iih = l 

h(jjN+h-i, ■ ■ -,yN+i,yN, ■ ■ -,yN-d+i) if he {2,..., H} 

This strategy outperformed the Direct and the Recursive strategies on two real-world time 



series: Santa Fe and Poland Electricity Load data sets (Sorjamaa and Lendasse 2006). Few 



research has been done regarding this strategy, so there is a need for further evaluation. 
2.4. MIMO strategy 

The three previous strategies (Recursive, Direct and DirRec) may be considered as Single- 



Output strategies (Ben Taieb et al. , 2010) since they model the data as a (multiple-input) single- 



output function (see Equations [2j [4] and [6]) . 

The introduction of the Multi-Input Multi-Output (MIMO) strategy (Bontempi, 2008; Bon- 
tempi and Ben Taieb, 2011) (also called Joint strategy ( |Kline 2004)) has been motivated by the 
need to avoid the modeling of single-output mapping, which neglects the existence of stochastic de- 



pendencies between future values and consequently affects the forecast accuracy (Bontempi, 2008 



Bontempi and Ben Taieb, 2011). 



The MIMO strategy learns one multiple-output model F from the time series [yi , . . . , yjy] where 



[yt+H, • • • , yt+i) = F(y t , . . . , yt-d+i) + w, (7) 
with t G {d, . . . ,N — H}, F : R d -> R H is a vector- valued function (|Micchelli and Pontil[ |2005[), 



and w G M. H is a noise vector with a covariance that is not necessarily diagonal (Matfas, 2005) . 
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The forecasts are returned in one step by a multiple-output model F where 



[y, 



t+H, 



, Vt+i] = F(y N , y N -d+i)- 



(8) 



The rationale of the MIMO strategy is to preserve, between the predicted values, the stochastic 
dependency characterizing the time series. This strategy avoids the conditional independence 
assumption made by the Direct strategy as well as the accumulation of errors from which plagues 
the Recursive strategy. So far, this strategy has been successfully applied to several real-world 



multi-step ahead time series forecasting tasks (Bontempi, 2008 Bontempi and Ben Taieb, 2011 



Ben Taieb et al. 2009 2010). 



However, the need to preserve the stochastic dependencies by using one model has a drawback 
as it constrains all the horizons to be forecasted with the same model structure. This constraint 



could reduce the flexibility of the forecasting approach (Ben Taieb et al. 2009). This was the 



motivation for the introduction of a new multiple-output strategy: DIRMO (Ben Taieb et al. 



2009, 2010), presented next. 



2.5. DIRMO strategy 



The DIRMO strategy ( Ben Taieb et al. 2009 ) aims to preserve the most appealing aspects of 
both the DIRect and miMO strategies. Taking a middle approach, DIRMO forecasts the horizon H 
in blocks, where each block is forecasted in a MIMO fashion. Thus, the -ff-step-ahead forecasting 
task is decomposed into n multiple-output forecasting tasks (n = — " ) , each with an output of size 

s (se{l,...,H}). 

When the value of the parameter s is 1, the number of forecasting tasks n is equal to H 
which correspond to the Direct strategy. When the value of the parameter s is H , the number of 
forecasting tasks n is equal to 1 which correspond to the MIMO strategy. There are intermediate 
configurations between these two extremes depending on the value of a parameter s. 

The tuning of the parameter s allows us to improve the flexibility of the MIMO strategy by 
calibrating the dimensionality of the outputs (no dependency in the case s = 1 and maximal 
dependency for s = H). This provides a beneficial trade off between the preserving a larger degree 
of the stochastic dependency between future values and having a greater flexibility of the predictor. 



The DIRMO strategy, previously called MISMO strategy (Ben Taieb et al. , 2009) (renamed for 
clarity reason) , learns n models F p from the time series [yi , . . . , y^] where 
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[yt+ P *s, yt+(p-i)* s +i] = F p(yt, yt-d+i) + w, (9) 

with t 6 {d, . . . , N — H}, p G {1, ... ,n} and F p : M d — > M s is a vector-valued function if s > 1. 
The -ff forecasts are returned by the n learned models as follows: 



[yN+p*s, ■■ ■ , yN+(p-x)*s+i] = F p(yN, ■ ■ ■ ,yjv-d+i)- (10) 

The DIRMO strategy has been successfully applied to two forecasting competitions: ESTSP'07 (Ben Taieb 



et al. 2009) and NN3 (Ben Taieb et al, 2010). 



2.6. Comparative Analysis 

To summarize, there are five possible forecasting strategies that perform a multi-step ahead 
forecasting task: Recursive, Direct, DirRec, MIMO and DIRMO strategies. Figure [T] shows the 
different forecasting strategies with links indicating their relationships. 



Recursive 





DIRMO 



Figure 1: The different forecasting strategies with the links showing their relationship. 

As we see, the DirRec strategy is a combination of the Direct and the Recursive strategy, while 
the DIRMO strategy is a combination of the Direct and the MIMO strategy. 

Contingent on the selected strategy, a different number and type of models will be required. 
Before presenting the general comparison of the multi-step ahead forecasting strategies, let us 
highlight using an example the differences between the forecasting strategies. 

Consider a multi-step ahead forecasting task for the time series [yi, . . . , yjsr] where the forecasting 
horizon H is 4. Table [T] shows, for each strategy, the different input sets and forecasting models 
involved in the calculation of the four forecasts [y~N+i, ■ ■ ■ , jjN+i]- 
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VN+l 


VN+2 


VN+3 


VN+4 


Recursive 


f(VNi—>VN-d+l) 


f(.VN+l,VN,—,VN-d+n) 


f(yN+2,yN+l,---,UN-d+3) 


f(yN+a,yN+2,---,yN-d+4) 


Direct 


fl(VN,—>VN-d+l) 




h{yN,—,DN-d+i) 


h{yN,—,VN-d+i) 


DirRec 


fl(VN:—>VN-d+l) 


h{i)N+i,yN,---,yN-d+i) 


h(yN+2,yN+i,---,yN-d+i) 


h{yN+3,VN+2,—,yN-d+i) 


MIMO 


F(VNt — ,VN-d+l) 


DIRMO = 2) 


FliVNi—tVN-d+l) 


FiiVNt—iVN-d+l) 



Table 1: The different forecasting models used by each strategy to obtain the 4 forecasts needed. 



Let Tso and Tmo denote the amount of computational time needed to learn (with a given 
learning algorithm) a Single-Output model and a Multiple-Output model, respectively. For a 
given iJ-step ahead forecasting task, we can see in Table [2] for each strategy the number and type 
of models to learn, the size of the output for each model as well as the computational time. 





Number of Models 


Types of models 


Size of output 


Computational time 


Recursive 


1 


SO 


1 


lxT so 


Direct 


H 


SO 


1 


HxTso 


DirRec 


H 


so 


1 


H x {Tso + H) 


MIMO 


1 


MO 


H 


1 x T MO 


DIRMO 


H 

s 


MO 


s 


f x T MO 



Table 2: For each forecasting strategy: the number and type of models (Single-Output or Multiple- 
Output) to learn and the size of the output for each model. 



Suppose Tmo = Tso + S, which is a reasonable assumption because learning a model with a 
vector- valued output takes more time than learning a model with a single- valued output. This 
allows us to rank the forecasting strategies according to their computation time for training given 
in Table [2] Indeed, we have 

1 x Tso < 1 x T MO < y x T MO < HxTso < Hx (T so + n\ , (11) 

Recursive MIMO ^ Direct DirRec 

DIRMO 

where we suppose that the parameter s of DIRMO is not equal to 1 or H. 

Note, in one hand, that the time needed to learn a SO model of the DirRec strategy equals 
Tso + A 4 because the input size of each SO task increases at each step. On the other hand, if we 
need to select the value of the parameter s by on some tuning, the DIRMO strategy will take more 
time and hence will be the slowest one. 
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In the following, we conclude this section by summarizing the pros and cons of the five fore- 
casting strategies as depicted on Table [3} 
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3. Lazy Learning for Time Series Forecasting 

Each of the forecasting strategies introduced in Section [2] demands the definition of a specific 
forecasting model or learning algorithm to estimate either the scalar-valued function / (see Equa- 
tions [TJ [3] and [5]) or the vector- valued function F (see Equations [7] and [9]) which represent the 
temporal stochastic dependencies. As the goal of the paper is not to compare forecasting mod- 
els (as in (Ahmed et al. 2010| )) but rather multi-step ahead forecasting strategies, the choice of a 
underlying forecasting model is required to setup the experiments. In this paper, we adopted the 
Lazy Learning algorithm, which is a particular instance of local learning models, since it has been 



shown to be particularly effective in time series forecasting tasks (Bontempi et al. , 1998; Bontempi 



1999, 2008 Bontempi and Ben Taieb, 2011 Ben Taieb et al., 2009, 2010) 



Next section gives a general comparison of global models with local models. Section 3.2 presents 
the Lazy Learning Algorithm in terms of learning properties. Section [3 . 3| and [3~4] describe two Lazy 
Learning algorithms for two types of learning tasks, namely the Single-Output and Multiple- Output 
Lazy Learning algorithms. Finally, the a discussion is presented on the model combination. 

3.1. Global vs local modeling for supervised learning 

Forecasting the future values of a time series using past observations can be reduced to a 
supervised learning problem or, more precisely, to a regression problem. Indeed, the time series 
can be seen as a dataset made of pairs where the first component, called input, is a past temporal 
pattern and the second, called output, is the corresponding future pattern. Being able to predict 
the unknown output for a given input is equivalent to forecasting the future values given the last 
observations of the time series. 

Global modeling is the typical approach to the supervised learning problem. Global models 
are parametric models that describe the relationship between the inputs and the output values 
as an analytical function over the whole input domain. Examples of global models are linear 



models (Montgomery et al. 2006), nonlinear statistical regressions (Seber and Wild, 1989) and 



Neural Networks (Rumelhart et al. , 1986). 



Another approach is the divide- and- conquer modeling which consists in relaxing the global 
modeling assumptions by dividing a complex problem into simpler problems, whose solutions can 



be combined to yield a solution to the original problem (Bontempi, 1999). The divide-and-conquer 
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has evolved in two different paradigms: the modular architectures and the local modeling ap- 



proach (Bontempi, 1999). 



Modular techniques replace a global model with different modules covering different parts of the 



input space. Examples based on this approach are Fuzzy Inference Systems (Takagi and Sugeno 



1985), Radial Basis Functions (Moody and Darken, 1989; Poggio and Girosi, 1990), Local Model 



Networks (Murray-Smith, 1994), Trees (Breiman et al. 1984) and Mixture of Experts (Jordan and 



Jacobs, 1994). The modular approach is in the intermediate scale between the two extremes, the 



global and the local approach. However, their identification is still performed on the basis of the 
whole dataset and requires the same procedures used for generic global models. 

Local modeling techniques are at the extreme end of divide-and-conquer methods. They are 
nonparametric models that combine excellent theoretical properties with a simple and flexible 
learning procedure. Indeed, they do not aim to return a complete description of the input/output 
mapping but rather to approximate the function in a neighborhood of the point to be predicted (also 
called the query point). There are different examples of local models, for example nearest neighbor, 



weighted average, and locally weighted regression (Atkeson et al. , 1997b). Each of these models use 



data points near the point to be predicted for estimating the unknown output. Nearest neighbor 
models simply find the closest point and uses its output value. Weighted average models combines 
the closest points by averaging them with weights inversely proportional to their distance to the 
point to be predicted. Locally weighted regression models fit a model to nearby points with a 
weighted regression where the weights are function of distances to the query point. 

The effectiveness of local models is well-known in the time series and computational intelligence 



community. For example, the method proposed by Sauer (Sauer, 1994) gave good performance and 



ranked second best forecast for the Santa Fe A dataset from a forecasting competition organized 
by Santa Fe institute. Moreover, the two top-ranked entries of the KULeuven competition used 



local learning methods (Bontempi et al. 1998 McNames 1998) 



In this work, we will restrict to consider a particular instance of local modeling algorithms: the 



Lazy Learning algorithm (Aha, 1997) 



3.2. The Lazy Learning algorithm 

It is possible to encounter different degree of "laziness" in local learning algorithms. For in- 
stance, a k Nearest Neighbor (fc-NN) algorithm, which learns the best value of k before the query is 
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requested, is hardly a lazy approach since, after the query is presented, it requires only a reduced 
amount of learning procedure, only the computation of the neighbors and the average. On the 
contrary, a local method, which depends on the query to select the number of neighbors or other 
structural parameters presents a higher degree of "laziness" . 



The Lazy Learning (LL) algorithm, extensively discussed in (Birattari et al. 1999 Birattari 



and Bersini, 1997), is a query-based local modeling technique where the whole learning procedure 



is deferred until a forecast is required. When the query is requested, the learning procedure may 
start to select the best value of the number of neighbors (or other structural parameters) and next, 
the dataset is searched for the nearest neighbors of the query point. The nearest neighbors are 
then used for estimating a local model, which returns a forecast. The local model is then discarded 
and the procedure is repeated from scratch for subsequent queries. 



The LL algorithm has a number of attractive features (Aha 1997), namely, the reduced number 
of assumptions, the online learning capability and the capacity to model nonstationarity. LL 
assumes no a priori knowledge on the process underlying the data, which is particularly relevant in 
real datasets. These considerations motivate the adoption of the LL algorithm as a learning model 
in a multi-step ahead forecasting context. 

Local modeling techniques require the definition of a set of model parameters namely the num- 
ber k of neighbors, the kernel function, the parametric family and the distance mefnc[REF]. In 
the literature, different methods have been proposed to select automatically the adequate config- 
uration (Atkeson et al. 1997b Birattari et al. 1999| ). However, in this paper, we will limit the 
search on only the selection of the number of neighbors (also called or equivalent to the bandwidth 
selection). This is essentially the most critical parameter, as it controls the bias/ variance trade-off. 



Bandwidth selection is usually performed by rule-of-thumb techniques (Fan and Gijbels, 1995), 



plug-in methods (Ruppert et al. 1995) or cross-validation strategies (Atkeson et al. 1997a). Con- 



cerning the other parameters, we use the tricubic kernel (Cleveland et al. , 1988) as kernel function, 
a constant model for the parametric family and the euclidean distance as metric. 

Note that in order to apply local learning to a time series, we need to embed it into a dataset 
made of pairs where the first component is a temporal pattern of length d and the second component 
is either the future value (in the case of Single-Output Modeling) or the consecutive temporal 
pattern of length H (in the case of Multiple-Output Modeling). In the following sections, D will 
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refer to the embedded time series with M input /output pairs. 
3.3. Single- Output Lazy Learning algorithm 

In the case of Single-Output learning (i.e with scalar output), the Lazy Learning procedure 
consists of a sequence of steps detailed in Algorithm [TJ The algorithm assesses the generalization 
performance of different local models and compares them in order to select the best one in terms of 
generalization capability. To perform that, the algorithm associate a Leave-One-Out (LOO) error 
e LOo(k) to the estimation y^ obtained with k neighbors (lines [4] and ^ . 

The LOO error can provide a reliable estimate of the generalization capability. However the 
disadvantage of such an approach is that it requires to repeat k times the training process, which 
means a large computational effort. Fortunately, in the case of linear models there exists a powerful 
statistical procedure to compute the LOO cross-validation measure at a reduced computational 



cost: the PRESS (Prediction Sum of Squares) statistic (Allen, 1974). 

In case of constant model, the LOO error eLoo{k) for the estimation of the query point 
x g is calculated as follows (Bontempi, 1999): 



e£Oo(*) = r£(«*(*)) S 



(12) 



where ej(k) designates the error obtained by setting aside the j-th neighbor of x q (j G {1, . . . , k}). 
If we define the output of the k closest neighbors of x 9 as {y\i], • • • , yny} then, ej(k) is defined as 



ej(k) = y {j] 



(* - l )v\j] - y® 



k-l 



(k - l)y {j] + y m - y {j] - f M 



k-l 



%] ~ Ei=i V[i 
k-l 



k 



y\j\ - yj 
k-i 



(AO 



(13) 
(14) 
(15) 
(16) 

(17) 



Note that if we use Equation 13 to calculate the LOO error (Equation 12), the training process 



is repeated k times since the sum in Equation 13 is performed for each index j. However, by 
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using the PRESS statistic (Equation 17), we avoid this large computational effort since the sum is 



(k) 

replaced by the previously computed y q which was already calculated. This makes the PRESS 
statistic an efficient method to compute the LOO error. 

After evaluating the performance of local models with different number of neighbors k (lines [3] 
to [6]), the best one which minimizes the LOO error (having index k*) is selected (lines [7] and [8]) . 
Finally, the prediction of the output of x q is returned (line [9]) . 



Algorithm 1: Single-Output Lazy Learning 



Input 
Input 
Input 
Output 



D = {{Xi,yi) G (M d x R) with % e {1, . . . ,M}}, dataset. 
x g 6 query point. 



Kmax, the maximum number of neighbors. 
y q , the prediction of the (scalar) output of the query point x g . 

1 Sort increasingly the set of vectors {xj} with respect to the distance to x ? . 

2 [j] designate the index of the jth closest neighbor of x q . 

3 for k G {2, . . . , Kmax} do 



4 
5 

6 end 



W 1 v^fc 

Calculate eioo{k) which is defined in Equation 
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7 k = arg min klE {2,...,Kmax} e L oo(k). 
s y q = y q ■ 

9 return y q . 



3.4- Multiple- Output Lazy Learning algorithm 

The adoption of Multiple-Output strategies requires the design of multiple-output (or equiva- 



lently multi-response) modeling techniques (Matfas 2005 Breiman and Friedman 1997 Micchelli 



and Pontil, |2005 ) where the output is no more a scalar quantity but a vector of values. Like in 



the Single-Output case, we need criteria to assess and compare local models with different number 
of neighbors. In the following, we present two criteria: the first one is an extension of the LOO 
error for the Multiple-Output case (Algorithm [2]) (Bontempi, 2008 Ben Taieb et al. 2010) and the 
second one is a criterion proper to the Multiple-Output modeling (Algorithm [3]) (Ben Taieb et al. 



2010 Bontempi and Ben Taieb, 2011). Note that, in the two algorithms, the output is a vector of 



size / (e.g. I will equal H with the MIMO strategy or s in the DIRMO strategy). 

Algorithm [2] is an extension of the Algorithm [T] for vectorial outputs. We still use the LOO 
cross-validation measure as a criterion to estimate the generalization capability of the model but 
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Algorithm 2: Multiple-Output Lazy Learning (LOO criterion): MIMO-LOO 



Input 
Input 
Input 
Output 



D = {(xj, yi) G (R d x R l ) with i G {1, . . . , M}}, dataset. 



Xq G query point. 



Kmax, the maximum number of neighbors. 

y q , the prediction of the (vectorial) output of the query point x ? . 

1 Sort increasingly the set of vectors {xj} with respect to the distance to x g . 

2 [j] will designate the index of the jth closest neighbor of x q . 

3 for k G {2, . . . , Kmax} do 



(fc) 1 TT^k 
yg =k2Z j= iy\j]- 



=i e LOo(k) where e l LOO (k) is defined in Equation 

6 end 

7 k* = arg min fce{ 2,... 1 Kma a: } E LO o(k). 

(k*) 

s y<? = yg 

9 return y g . 



12 



here, the LOO error is an aggregation of the errors obtained for each output (line [5]). Note that 
the same number of neighbors is selected for all the outputs (e.g. MIMO strategy) unlike what 
could happen with different Single-Output tasks (e.g. Direct strategy). 
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The second criterion uses the fact that the forecasting horizon H is supposed to be large (multi- 
step ahead forecasting) and hence we have enough samples to estimate some descriptive statistics. 
Then, instead of using the Leave-One-Out error, we can use as criterion a measure of stochastic 
discrepancy between the forecasted values and the training time series. The lower the discrepancy 
between the descriptors of the forecasts and the training time series, the better is the quality of 



the forecasts (Bontempi and Ben Taieb, 2011). 

Several measures of discrepancy can be defined, both linear and non-linear. For example, the 
autocorrelation can be used as linear statistics and the maximum likelihood as a non-linear one. In 
this work, we will consider only one linear measure using both the autocorrelation and the partial 
correlation. 

(k) 

The assessement of the quality of the estimation y q of the query point x g is calculated as 
follows 

E A (k) = 1 - \cor[p(ts ■ yW),p(ts)]\ + 1 - \cor[7r(ts ■ yf), 7r(is)]| , (18) 

where the symbol "•" represents the concatenation, ts represent the training time series and cor is 
the Pearson correlation. This discrepancy measure is composed of two parts where the first part 
uses the autocorrelation (noted p) and the second uses the partial autocorrelation (noted ir). 

For each part, we calculate the discrepancy (estimated with the correlation, noted cor) between, 
on one hand, the autocorrelation (or partial autocorrelation) of the concatenation of the training 

( k) 

time series ts and the forecasted sequence and, on the other hand, the autocorrelation (or 



partial autocorrelation) of the training time series ts (Bontempi, 2008 Ben Taieb et al. 2009). 

In Algorithm [3j after evaluating the performance of local models with different number of 
neighbors k (lines [3] to [6]), the best one, which minimizes the discrepancy between the forecasting 
sequence and the training time series (having index &:*), is selected (lines [7] and [8]). In other words, 
the goal is to select the best number of neighbors k* which preserve the stochastic properties of 
the time series in the forecasted sequence. Finally, the prediction of the output of x g is returned 
(line[9]). 
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Algorithm 3: Multiple-Output Lazy Learning (discreprancy criterion): MIMO-ACFLIN 
Input : ts = [tsi, . . . , £sjv]> time series. 

Input : D = {(xi,yj) G (R d x M z ) with i G {1, . . .,M}}, dataset. 
Input : Xg G M d , query point. 

Input : Kmax, the maximum number of neighbors. 

Output: y q , the prediction of the (vectorial) output of the query point x ? . 

1 Sort increasingly the set of vectors {xj} with respect to the distance to x g . 

2 [j] will designate the index of the jth closest neighbor of x q . 

3 for k G {2, . . . , Kmax} do 



(*0 l v^fe 
y? =k22 j= iy\j]- 



Calculate E^{k) which is defined in Equation 

6 end 

7 k* = arg mm ke{2 ,...,Kmax} E A (k). 
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(**) 

s y 9 = y q 



9 return y g . 
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3.5. Model selection or model averaging 

Considering the Algorithm [TJ we can see that we generate, for the query x q , a set of predic- 
tions {y q 2 \ y^q \ • • • , Hg Kmax ^}, each obtained with different number of neighbors. For each of these 
predictions, a testing error {eLoo(2), ez,oo(3), • • • , eLOo(Kmax)} has been calculated. Note that 
the next considerations are also applicable to Algorithms [2] and [3j 

The goal of model selection is to use all this information (set of predictions and testing errors) 
to estimate the final prediction y q of the query point x q . There exist two main paradigms mainly 
the winner-take- all and combination approaches. 

In the Algorithm [TJ we presented the winner-take-all approach (noted WINNER) ( |Maron and 



Moore 



- • (k) 

1997) which consists of comparing the set of models y\ and selecting the best one in terms 



of testing error eLoo(k) (see line[7|). 

Selecting the best model according to the testing error is intuitively the approach which should 
work the best. However, results in machine learning show that the performance of the final model 



can be improved by combining models having different structures (Raudys and Zliobaite, 2006 



Jacobs et al. 


1991 


Breiman 


1996; 



In order to apply the model averaging, lines 7 and 8 of the Algorithm [TJ can be replaced by 



(2) , 

P2 y q H vpk 



(Krnax) 
mux yq 



Lfc=2 Pk 



(19) 



where an average is calculated. The weights pk will take different values depending on the combi- 
nation approach adopted. If pk equals K ^ rmx , we are in the case of equally weighted combination 
and y q reduces to an arithmetic mean (noted COMB). Otherwise, if weights are assigned according 
to testing errors, pk will equal eL< ^ ^) an d 2/<j reduces to a weighted mean (noted WCOMB). 
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4. Experimental Setup 

4-1. Time Series Data 

In the last decade, several time series forecasting competitions (e.g. the NN3, NN5, and the 
ESTSP competitions (Crone ajb Lendasse, 2007 2008)) have been organized in order to compare 
and evaluate the performance of computational intelligence methods. Among them, the NN5 
competition ( Crone , b]) is one of the most interesting one since it includes the challenges of a real- 
world multi-step ahead forecasting task, namely multiple time series, outliers, missing values as 
well as multiple overlying seasonalities, etc. Figure [2] shows four time series from the NN5 dataset. 

Each of the 111 time series of this competition represents roughly two years of daily cash money 
withdrawal amounts (735 data points) at ATM machines at one of the various cities in the UK. 
For each time series, the competition required to forecast the values of the next 56 days, using the 
given historical data points, as accurately as possible. The performance of the forecasting methods 
over one time series was assessed by the symmetric mean absolute percentage of error (SMAPE) 
measure (Crone b|, defined as 



SMAPE 



1 H 



\Uh - Vh\ 



x 100, 



(20) 



H ti ^ h + Vh)/2 

where yt is the target output and yh is the prediction. Since this is a relative error measure, the 
errors can be averaged over all time series to obtain a mean SMAPE defined as 



111 

SMAPE* = V SMAPE; 

111 ^ 

i=l 



(21) 



where SMAPE,- denotes the SMAPE of the ith time series. 
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4-2. Methodology 

The aim of the experimental study is to compare the accuracy of the five forecasting strategies 
in the context of the NN5 competition. Since the accuracy of a forecasting technique is known to 
be dependent on several design choices (e.g. the deseasonalization or the input selection) and we 
want to focus our analysis on the multi-step ahead forecasting strategies, we consider a number 
of different configurations in order to increase the statistical power of our comparison. Every 
configuration is composed of several preprocessing steps as sketched in Figure [3| Since some of 
these steps (e.g. deseasonalization) can be performed in alternative ways (e.g. two alternatives 
for the deseasonalization, two alternatives for input selection, three alternatives for the model 
selection), we come up with 12 configurations. The details about each step are given in what 
follows. 




1 Gaps removal 



2. Deseasonalization : Yes or No 



i 



3. Embedding Dimension Selection 



4. Input Selection : Yes or No 



Model Selection : 
WINNER or COMB or WCOMB 



Figure 3: The Different Preprocessing Steps. 



Step 1: Gaps removal 

The specificity of the NN5 series requires a preprocessing step called gaps removal where by 
gap we mean two types of anomalies: (i) zero values that indicate that no money withdrawal 
occurred and (ii) missing observations for which no value was recorded. About 2.5% of the 
data are corrupted by gaps. In our experiments we adopted the gap removal method proposed 



in (Wichard, 2010): if y m is the gap sample, this method replaces the gap with the median of the 
set [y m +365,ym-365,ym+7 and y m -i\ among which are available. 
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Step 2: Deseasonalization The adoption of deseasonalization may have a large impact on the 
forecasting strategies because the NN5 time series possess a variety of periodic patterns. For that 
reason we decided to consider tasks with and without deseasonalization in order to better account 
for the role of the forecasting strategy. We adopt the deseasonalization methodology discussed in 



(Andrawis et al. 2011) to remove the strong day of the week seasonality as well as the moderate 
day of the month seasonality. Of course after we deseasonalize and apply the forecasting model we 
restore back the seasonality. 
Step 3: Embedding dimension selection 

Every forecasting strategy requires the setting of the size d of the embedding dimension (see 
Equations[T]to[9]) . Several approaches have been proposed in the literature to select this value ( Kants 



and Schreiber 2004). Since this aspect is not a central theme in our paper we just applied the 



state-of-the-art approach reviewed in ( Crone and Kourentzes 2009 ) , which consists of selecting the 
time-lagged realizations with significant partial correlation function (PACF). This method allows 
to select the value of the embedding dimension and then to identify the relevant variables within 
the window of past observations. We set the maximum lag of the PACF to 200 to provide a suf- 
ficiently comprehensive pool of features. However, note that the final dimensionality of the input 
vectors for all the time series is on average equal to 24. 
Step 4: Input Selection 

We considered the forecasting task with and without input variable selection step. A variable 
selection procedure requires the setting of two elements: the relevance criterion, i.e. statistics which 
estimates the quality of the selected variables, and the search procedure, which describes the policy 
to explore the input space. We adopted the Delta test(DT) as relevance criterion. The DT has 



been introduced in time series forecasting domain by Pi and Peterson in (Pi and Peterson, 1994) 



and later successfully applied to several forecasting task (Ben Taieb et al. 2009, 2010; Liitiainen 



and Lendasse, 2007 Guillen et al. , 2008; Mateo and and, 2010). This criterion is based on applying 



some kind of a noise variance estimator, and then selecting the set of input variables that yield the 



strongest and most deterministic dependence between inputs and outputs (Mateo and Lendasse 



2008). 



Concerning the search procedure, we adopted a Forward-Backward Search (FBS) procedure 
which is a combination of forward selection (sequentially adding input variables) and backward 
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search (sequentially removing some input variables). This choice was motivated by the flexibility 
of the FBS procedure which allows a deeper exploration of the input space. Note that the search 
is initialized by the set of variables defined in the previous step. 
Step 5: Model Selection 



Concerning the model selection procedure, three approaches (see Section 3.5) are taken into 
consideration in our experiments: 

WINNER : This approach selects the model that gives best performance for the test set (winner- 
take-all approach). 

COMB : This approach combines all alternative models by simple averaging. 

WCOMB : This approach combines models by weighted averaging where weights are inversely 
proportional to the test errors. 

4-2.1. The Compared forecasting strategies 

Table [4] presents the eight forecasting strategies that we tested, showing also their respective 
acronyms. 
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1. REC 


The Recursive forecasting strategy. 


2. DIR 


The Direct forecasting strategy. 


3. DIRREC 


The DirRec forecasting strategy. 


MIMO 


4. MIMO-LOO 


A variant of the MIMO forecasting 
strategy with the LOO selection 
criteria. 


5. MIMO-ACFLIN 


A variant of the MIMO forecasting 
strategy with the autocorrelation 
selection criteria. 


DIRMO 


6. DIRMO-SEL 


The DIRMO forecasting strategy 
which select the best value of the 
parameter s. 


7. DIRMO- AVG 


A variant of the DIRMO strategy 
which calculates a simple average 
of the forecasts obtained with 
different values of the parameter s. 


8. DIRMO- WAVG 


The DIRMO- AVG with a weighted 
average where weights are inversely 
proportional to testing errors. 



Table 4: The five forecasting strategies with their respective variants which gives eight forecasting 
strategies. 
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4-2.2. Forecasting performance evaluation 

This section describes the assessment procedure (Figure [3]) of the 8 forecasting strategies. 



r \ 






Measure of average 




Test for significant 


forecasting errors on 




differences in average 


all NN5 time series 




forecasting ranks 


using 




using 


SMAPE* 




Friedman test 






) 



Identification of groups 
which significantly 
differ from others 

using 

Post-hoc test 



Figure 4: Different steps of the forecasting performance evaluation. 

The procedure for comparing between the eight forecasting strategies is shown in Figure |4j The 
accuracy of each forecasting strategy is first measured using the SMAPE* measure calculated over 



the 111 time series and defined in Equation 21 To test if there are significant general differences 
in performance between the different strategies, we have to consider the problem of comparing 



multiple models on multiple data sets. For such case Demsar (Demsar 2006 Garca and Herrera 



2009) in a detailed comparative study recommended using a two stage procedure: first to apply 
Friedman's or Iman and Davenport's tests to test if the compared models have the same mean 
rank. If this test rejects the null-hypothesis, then post-hoc pairwise tests are to be performed to 
compare the different models. These tests adjust the critical values higher to ensure that there is 
at most a 5% chance that one of the pairwise differences will be erroneously found significant. 
Friedman test 



The Friedman test (Friedman, 1937, 1940) is a non-parametric procedure which tests the sig- 
nificance of differences between multiple ranks. It ranks the algorithms for each dataset separately: 
the rank of 1 will be given to the best performing algorithm, the rank of 2 to the second best and 
so on. Note that average ranks are assigned in case of ties. 

After ranking the algorithms for each dataset, the Friedman test compares the average ranks of 
algorithms. Let r % - be the rank of the j-th of k algorithms on the i-th of N data sets, the average 
rank of the j-th. algorithm is Rj = ^ i rj- . 

The null-hypothesis states that all the algorithms are equivalent and so their ranks Rj should 
be equal. Under the null-hypothesis, the Friedman statistic 
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Q 



12N 

k(k + l) 



k(k + l) 



(22) 



is distributed according to a chi-squared with k — 1 degrees of freedom (x?_i) 5 when ./V and A; are 



large enough (as a rule of a thumb, iV > 10 and k > 5) (Demsar 2006) 



Iman and Danvenport (Iman and Davenport 1980), showing that Friedman's statistic is unde- 



sirably conservative, derived another improved statistic, given by 



S 



(23) 



(N-1)Q 
N(k - 1) - Q 

which is distributed, under the null- hypothesis, according to the F-distribution with k — 1 and 
(k — l)(iV — 1) degrees of freedom. 
Post-hoc test 

When the null-hypothesis is rejected, i.e. there is a significant difference between at least 2 
strategies, a post-hoc test is performed to identify significant pairwise differences among all the 
algorithms. The test statistic for comparing the i-th and the j-th algorithm is 



(Ri — Rj) 



fc(fc+i) 

6N 



(24) 



which is asymptotically normally distributed under the null hypothesis. After the corresponding 
p-value is calculated, it is compared with a given level of significance a. 

However, in multiple comparisons, as there are a possibly large number of pairwise comparisons, 
there is a relatively high chance that some pairwise test are incorrectly rejected. Several procedures 
exist to adjust the value of a to compensate for this bias, for instance Nemenyi, Holm, Shaffer as well 



as Bergmann and Hommel (Demsar 2006). Based on the suggestion of Garcia and Herrera (Garca 



and Herrera 2009) we adopted Shaffer's correction. The reason is that Garcia and Herrera (Garca 



and Herrera, 2009) showed that Shaffer's procedure has the same complexity Holm's procedure, 



but with the advantage of using information about logically related hypothesis. 
4-3. Experimental phases 

In order to reproduce the same context of the NN5 forecasting competition the experimental 
setting is made of two phases: the pre- competition and the competition phase. 
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4-3.1. Pre- competition phase 

The pre-competition phase is devoted to the comparison of the different forecasting strategies 
using the available observations of 111 time series. The goal is to learn the different parameters 
and then estimate the forecasting performance and compare between the different strategies. 

To estimate the forecasting performance of each strategy, we used a learning scheme with 
training-validation-testing sets. Each time series (containing 735 observations) is partitioned in 
three mutually exclusive sets (A, B and C) as shown in Figure [5] training (Day 1 to Day 623: 623 
values), validation (Day 624 to Day 679: 56 values) and testing (Day 680 to Day 735: 56 values). 

The validation set (B in Figure [5]) is used to build and tune the models. Specifically, as 
we use a Lazy Learning approach, we need to select, for each model, the range of number of 
neighbors([2, . . . , Kmax]) to use in performing the forecasting task. 

The test set (C in Figure [5]) is used to measure the performances of each forecasting strategy. 
To make the utmost use of the available data, we adopt a multiple time origin test as suggested by 



Tashman in (Tashman, 2000), where the time origin denotes the point from which the multi-step 
ahead forecasts are generated. 

The time origin and corresponding forecast intervals are given as: 

1. Day 680 to Day 735 (56 data points) 

2. Day 687 to Day 735 (49 data points) 

3. Day 694 to Day 735 (42 data points) 

In other words, we perform the forecast three times starting from the three different starting 
points, each time forecasting a number of steps ahead till the end of the interval. Note that we 
used the same test period and evaluation criterion (i.e. the SMAPE) as used by Andrawis et 



al in (Andrawis et al. 2011). This allows us to compare our results with several other machine 
learning models tested in this article. 

4-3.2. Competition phase 

In the competition phase we generate the final forecasts, made up of 56 future observations, 
which would have been submitted to the competition. This phase takes advantage of the lessons 
learned and the design choices made in the pre-competition phase. Here, we combine the training 
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Figure 5: Learning with three mutually exclusive sets for training (A), validation (B) and test- 
ing(C). 

set with the test set (A+B in Figure [6]) to retrain the models of the different strategies and then 
generate the final forecast (which will be submitted to the competition). The training set (A+B 
on Figure [6]) is now made of 679 data points and the validation set (C on Figure [6| is composed 
of the next 56 data points, as shown in Figure [6| In other words, the 735 values are then used to 
build and tune the models, which will next return the forecasted values. 
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Time 



Figure 6: Forecasting with two mutually exclusive sets for training (A+B) and validation (C). 
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5. Results and discussion 



This section presents and discusses the prediction results of the forecasting strategies for the 
pre- competition and competition phase. For each phase, we report the results obtained in the 12 



different configurations introduced in Section 4.2 



The forecasting performance of the strategies is measured using the criteria discussed in Sec- 



tion 4.2.2 and presented by means of two tables. The first one provides the average SMAPE as well 
as the ranking for each forecasting strategy. Since the null-hypothesis stating that all the algorithms 
are equivalent has been rejected (using the Iman-Davenport statistic) for all the configurations, 
we proceeded with the post-hoc test. The second table presents the results of this post-hoc test, 
which partitions the set of strategies in several groups which are statistically signicantly different 
in terms of forecasting performance. 

Note that the configurations which require the input selection do not contain the DIRMO 
results since combining the selection of the inputs with the selection of the parameter s would have 
needed an excessive amount of computation time. 

5.1. Pre- competition results 

The SMAPE and ranking results for the pre- competition phase are presented in Table [5] while 
the results of the post-hoc test are summarized in Table [6j 
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Deseasona 


ization : No - Input Selection : No 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR. 


22.37[0.55](7) 


5.19(7) 


22.19[0.54](7) 


4.96(6) 


22.61[0.55](5) 


5.47(7) 


REC 


21.41[0.59](6) 


3.98(5) 


21.95[0.58](5) 


4.63(5) 


22.91[0.62](7) 


4.73(5) 


DIRREC 


45.25[0.89](8) 


8.00(8) 


40.73[0.83](8) 


8.00(8) 


38.93[0.77](8) 


7.99(8) 


MIMO-LOO 


21.17[0.60](2) 


3.64(2) 


20.61[0.57](1) 


2.77(1) 


20.61[0.58](1) 


2.71(1) 


MIMO-ACFLIN 


21.40[0.59](5) 


4.48(6) 


20.69[0.58](2) 


3.17(2) 


20.61[0.58](2) 


2.71(2) 


DIRMO-SEL 


21.21[0.56](3) 


3.76(3) 


22.15[0.56](6) 


5.40(7) 


22.68[0.56](6) 


5.32(6) 


DIRMO-AVG 


21.27[0.56](4) 


3.88(4) 


20.90[0.56](3) 


3.36(3) 


21.16[0.56](3) 


3.42(3) 


DIRMO-WAVG 


21.12[0.56](1) 


3.07(1) 


20.96[0.56](4) 


3.71(4) 


21.25[0.57](4) 


3.65(4) 



(a) 





Deseasona 


ization : No - Input Selection : Yes 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 


22.80[0.50](4) 


3.20(4) 


22.73[0.52](4) 


3.27(4) 


23.37[0.54](4) 


3.27(4) 


REC 


21.17[0.55](2) 


2.20(2) 


22.21[0.52](3) 


2.96(3) 


23.20[0.56](3) 


3.01(3) 


DIRREC 


45.21[0.93](5) 


5.00(5) 


40.57[0.83](5) 


4.96(5) 


39.03[0.75](5) 


4.96(5) 


MIMO-LOO 


21.O6[0.57](l) 


2.02(1) 


20.60[0.59](1) 


1.79(1) 


20.64[0.59](1) 


1.88(1) 


MIMO-ACFLIN 


21.40[0.57](3) 


2.58(3) 


20.65[0.59](2) 


2.01(2) 


20.64[0.59](2) 


1.88(2) 



(b) 





Deseasona 


ization : Yes - Input Selection : No 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 


22.29[0.57](7) 


6.42(8) 


20.41[0.57](7) 


6.23(8) 


20.49[0.58](7) 


6.43(8) 


REC 


21.61[0.61](6) 


5.35(6) 


19.39[0.59](6) 


3.87(4) 


19.31[0.59](5) 


3.68(2) 


DIRREC 


22.61 [0.68] (8) 


6.33(7) 


20.67[0.64](8) 


6.15(7) 


20.61[0.61](8) 


6.04(7) 


MIMO-LOO 


19.81 [0.59] (4) 


3.83(4) 


19.21[0.60](4) 


3.69(1) 


19.25 [0.60] (3) 


3.82(4) 


MIMO-ACFLIN 


19.34[0.59](1) 


3.21(3) 


19.19[0.59](3) 


4.00(5) 


19.25 [0.60] (4) 


3.82(5) 


DIRMO-SEL 


19.49[0.57](2) 


2.90(1) 


18.98[0.57](1) 


3.72(2) 


19.04[0.58](1) 


3.66(1) 


DIRMO-AVG 


20.39[0.56](5) 


4.74(5) 


19.36[0.58](5) 


4.61(6) 


19.43[0.58](6) 


4.77(6) 


DIRMO-WAVG 


19.71[0.58](3) 


3.21(2) 


19.02[0.57](2) 


3.72(3) 


19.11[0.58](2) 


3.79(3) 



(c) 





Deseasona 


ization : Yes - Input Selection : Yes 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 


21.87[0.47](4) 


3.95(5) 


20.07[0.49](4) 


3.82(5) 


20.20[0.51](4) 


3.97(5) 


REC 


20.52[0.53](3) 


2.95(3) 


19.14[0.55](3) 


2.57(3) 


19.20[0.55](3) 


2.61(3) 


DIRREC 


23.06[0.84](5) 


3.77(4) 


21.03[0.68](5) 


3.77(4) 


20.95[0.66](5) 


3.74(4) 


MIMO-LOO 


20.02[0.53](2) 


2.59(2) 


18.86[0.54](2) 


2.45(2) 


18.88[0.54](1) 


2.34(1) 


MIMO-ACFLIN 


18.95[0.54](1) 


1.76(1) 


18.81[0.54](1) 


2.38(1) 


18.88[0.54](2) 


2.34(2) 



(d) 



Table 5: Pre- competition phase. Average forecasting errors (SMAPE*) with average ranking 
for each strategy in the 12 different configurations. The numbers in round bracket represent the 
ranking within the column. 
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Deseasonalization : No - Input Selection : No 


WINNER 


COMB 


WCOMB 


1. " DIRMO-WAVG" (3.07) 
"MIMO-LOO" (3.64) 
"DIRMO-SEL" (3.76) 
"DIRMO-AVG" (3.88) 
"R.EC" (3.98) 
"MIMO-ACFLIN" (4.48) 
"DIR" (5.19) 

2. "DIRREC" (8.00) 


1. "MIMO-LOO" (2.77) 
"MIMO-ACFLIN" (3.17) 
"DIRMO-AVG" (3.36) 
"DIRMO-WAVG" (3.71) 

2. "REC" (4.63) 
"DIR" (4.96) 
"DIRMO-SEL" (5.40) 

3. "DIRREC" (8.00) 


1. "MIMO-LOO" (2.71) 
"MIMO-ACFLIN" (2.71) 
"DIRMO-AVG" (3.42) 
"DIRMO-WAVG" (3.65) 

2. "REC" (4.73) 
"DIRMO-SEL" (5.32) 
"DIR" (5.47) 

3. "DIRREC" (7.99) 


(a) 


Deseasonalization : No - Input Selection : Yes 


WINNER 


COMB 


WCOMB 


1. "MIMO-LOO" (2.02) 
"REC" (2.20) 
"MIMO-ACFLIN" (2.58) 

2. "DIR" (3.20) 

3. "DIRREC" (5.00) 


1. "MIMO-LOO" (1.79) 
"MIMO-ACFLIN" (2.01) 

2. "REC" (2.96) 
"DIR" (3.27) 

3. "DIRREC" (4.96) 


1. "MIMO-LOO" (1.88) 
"MIMO-ACFLIN" (1.88) 

2. "REC" (3.01) 
"DIR" (3.27) 

3. "DIRREC" (4.96) 


(b) 


Deseasonalization : Yes - Input Selection : No 


WINNER 


COMB 


WCOMB 


1. "DIRMO-SEL" (2.90) 
"DIRMO-WAVG" (3.21) 
"MIMO-ACFLIN" (3.21) 
"MIMO-LOO" (3.83) 

2. "DIRMO-AVG" (4.74) 
" R.EC" (5.35) 

3. "DIRREC" (6.33) 
"DIR" (6.42) 


1. "MIMO-LOO" (3.69) 
"DIRMO-SEL" (3.72) 
"DIRMO-WAVG" (3.72) 
"REC" (3.87) 
"MIMO-ACFLIN" (4.00) 
"DIRMO-AVG" (4.61) 

2. "DIRREC" (6.15) 
"DIR" (6.23) 


1. "DIRMO-SEL" (3.66) 
"REC" (3.68) 
"DIRMO-WAVG" (3.79) 
"MIMO-LOO" (3.82) 
"MIMO-ACFLIN" (3.82) 

2. "DIRMO-AVG" (4.77) 

3. "DIRREC" (6.04) 
"DIR" (6.43) 


(c) 


Deseasonalization : Yes - Input Selection : Yes 


WINNER 


COMB 


WCOMB 


1. "MIMO-ACFLIN" (1.76) 

2. "MIMO-LOO" (2.59) 
"REC" (2.95) 

3. "DIRREC" (3.77) 
"DIR" (3.95) 


1. "MIMO-ACFLIN" (2.38) 
"MIMO-LOO" (2.45) 
"REC" (2.57) 

2. "DIRREC" (3.77) 
"DIR" (3.82) 


1. "MIMO-LOO" (2.34) 
"MIMO-ACFLIN" (2.34) 
"REC" (2.61) 

2. "DIRREC" (3.74) 
"DIR" (3.97) 



(d) 



Table 6: Pre- competition phase. Group of strategies statistically significantly different (sorted 
by increasing ranking) provided by Friedman and post-hoc tests for the 12 configurations. 
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The availability of the SMAPE* results, obtained according to the procedure used in ( Andrawis 



et al. 2011), makes possible the comparison of our pre-competition results with those of several 



others learning methods available in (Andrawis et al. 2011). For the sake of comparison, Table 



[7] reports the forecasting errors for some of the techniques considered in (Andrawis et al. 2011), 
notably Gaussian Process Regression (GPR), Neural Network (NN), Multiple Regression (MULT- 
REGR), Simple Moving Average (MOV-AVG), Holt's Exponential Smoothing and a combination 



(Combined) of such techniques. The comparison shows that the best configuration of Table 5d , that 



Model 


SMAPE* 


GPR-ITER 


19.90 


GPR-DIR 


21.22 


GPR-LEV 


20.19 


NN-ITER 


21.11 


NN-LEV 


19.83 


MULT-REGR1 


19.11 


MULTI-REGR2 


18.96 


MULT-REGR3 


18.94 


MOV-AVG 


19.55 


Holt's Exp Sm 


23.77 


Combined 


18.95 



Table 7: Forecasting errors for the different forecasting models. 



is the MIMO-ACFLIN strategy, is competitive with all these models with a SMAPE* amounting 
to 18.81%. 
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5.2. Competition results 

The SMAPE and ranking results for the competition phase are presented in Table [8] while the 
results of the post-hoc test are summarized in Table [9j 
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Deseasona 


lization : JNo - Input Selection : JNo 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 


24.48[0.52](7) 


5.58(7) 


23.06[0.50](6) 


5.21(6) 


22.89[0.48](6) 


5.20(7) 


REC 


24.22[0.62](6) 


4.96(6) 


23.71 [0.52] (7) 


5.24(7) 


23.54[0.54](7) 


5.05(6) 


DIRREC 


44.97[0.85](8) 


7.93(8) 


40.37[0.80](8) 


7.97(8) 


38.17[0.72](8) 


7.95(8) 


MIMO-LOO 


21.92[0.56](1) 


2.90(1) 


21.55[0.50](1) 


2.84(1) 


21.64[0.49](1) 


2.79(1) 


MIMO-ACFLIN 


22.45[0.54](3) 


3.65(3) 


21.57[0.50](2) 


3.08(2) 


21.64[0.49](2) 


2.79(2) 


DIRMO-SEL 


22.60[0.54](5) 


3.99(5) 


22.27[0.49](5) 


4.46(5) 


22.50[0.50](5) 


4.84(5) 


DIRMO-AVG 


22.55[0.53](4) 


3.78(4) 


21.73[0.49](4) 


3.68(4) 


21.93[0.49](4) 


3.84(4) 


DIRMO-WAVG 


22.36[0.54](2) 


3.21(2) 


21.70[0.49](3) 


3.52(3) 


21.87[0.49](3) 


3.55(3) 



(a) 





Deseasonalization : JNo - Input Selection : Yes 


Strategies 


WINNER 


COMB 


COMBW 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 
REC 
DIRREC 


24.43[0.52](4) 
22.73[0.69](3) 
44.91[0.86](5) 


3.28(4) 
2.28(2) 
4.94(5) 


23.14[0.48](4) 
22.45[0.67](3) 
39.88[0.77](5) 


3.22(4) 
2.53(3) 
4.93(5) 


23.25[0.48](4) 
22.57[0.65](3) 
38.04[0.71](5) 


3.24(4) 
2.42(3) 
4.95(5) 


MIMO-LOO 
MIMO-ACFLIN 


22.18[0.53](1) 

22.62[0.51](2) 


2.12(1) 

2.39(3) 


21.78[0.49](1) 

21.83[0.50](2) 


2.09(1) 

2.24(2) 


21.84[0.49](1) 

21.84[0.49](2) 


2.19(1) 

2.19(2) 



(b) 





Deseasona 


ization : Yes - Input Selection : JNo 


Strategies 


WINNER 


COMB 


COMBW 




SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 


23.98[0.55](8) 


6.36(8) 


21.65[0.46](8) 


6.05(8) 


21.75[0.47](7) 


6.14(8) 


REC 


23.12[0.59](6) 


5.47(6) 


21.39[0.49](6) 


5.04(6) 


21.86[0.49](8) 


5.40(6) 


DIRREC 


23.51[0.60](7) 


6.15(7) 


21.58[0.52](7) 


5.78(7) 


21.57[0.51](6) 


5.67(7) 


MIMO-LOO 


21.11[0.54](4) 


3.69(4) 


20.27[0.47](4) 


3.77(3) 


20.34[0.47](3) 


3.69(3) 


MIMO-ACFLIN 


20.25[0.47](1) 


3.05(1) 


20.18[0.46](2) 


3.67(2) 


20.34[0.47](4) 


3.69(4) 


DIRMO-SEL 


20.85[0.51](2) 


3.45(3) 


20.18[0.46](3) 


3.77(4) 


20.23[0.45](1) 


3.45(1) 


DIRMO-AVG 


21.66[0.51](5) 


4.47(5) 


20.38[0.46](5) 


4.34(5) 


20.48[0.46](5) 


4.38(5) 


DIRMO-WAVG 


20.97[0.51](3) 


3.35(2) 


20.15[0.46](1) 


3.59(1) 


20.23[0.46](2) 


3.59(2) 



(c) 





Deseasonalization : Yes - Input Selection : Yes 


Strategies 


WINNER 


COMB 


COMBW 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


SMAPE* [Std err] 


Avg Rank 


DIR 
REC 
DIRREC 


23.91[0.50](5) 
22.57[0.64](3) 
23.66[0.58](4) 


4.14(5) 
3.05(3) 
3.79(4) 


21.54[0.47](5) 
21.39[0.63](3) 
21.48[0.52](4) 


3.94(5) 
2.96(3) 
3.50(4) 


21.58[0.48](5) 
21.45[0.64](3) 
21.47[0.51](4) 


3.91(5) 
2.93(3) 
3.50(4) 


MIMO-LOO 
MIMO-ACFLIN 


20.74[0.51](2) 
20.39[0.54](1) 


2.14(2) 
1.87(1) 


20.28[0.53](1) 

20.28[0.53](2) 


2.27(1) 

2.32(2) 


20.28[0.52](1) 

20.28[0.52](2) 


2.33(1) 

2.33(2) 



(d) 



Table 8: Competition phase. Average forecasting errors (SMAPE*) with average ranking for 
each strategy in the 12 different configurations. The numbers in round bracket represent the 
ranking within the column. 
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Deseasonalization : No - Input Selection : No 


WINNER 


COMB 


WCOMB 


1. "MIMO-LOO" (2.90) 
"DIRMO-WAVG" (3.21) 
"MIMO-ACFLIN" (3.65) 
"DIRMO-AVG" (3.78) 
"DIRMO-SEL" (3.99) 

2. "REC" (4.96) 
"DIR" (5.58) 

3. "DIRREC" (7.93) 


1. "MIMO-LOO" (2.84) 
"MIMO-ACFLIN" (3.08) 
"DIRMO-WAVG" (3.52) 
"DIRMO-AVG" (3.68) 
"DIRMO-SEL" (4.46) 
"DIR" (5.21) 

"REC" (5.24) 

2. "DIRREC" (7.97) 


1. "MIMO-LOO" (2.79) 
"MIMO-ACFLIN" (2.79) 
"DIRMO-WAVG" (3.55) 
"DIRMO-AVG" (3.84) 

2. "DIRMO-SEL" (4.84) 
"REC" (5.05) 
"DIR" (5.20) 

3. "DIRREC" (7.95) 


(a) 


Deseasonalization : No - Input Selection : Yes 


WINNER 


COMB 


WCOMB 


1. "MIMO-LOO" (2.12) 
"REC" (2.28) 
"MIMO-ACFLIN" (2.39) 

2. "DIR" (3.28) 

3. "DIRREC" (4.94) 


1. "MIMO-LOO" (2.09) 
"MIMO-ACFLIN" (2.24) 
" REC" (2.53) 

2. "DIR" (3.22) 

3. "DIRREC" (4.93) 


1. "MIMO-LOO" (2.19) 
"MIMO-ACFLIN" (2.19) 
"REC" (2.42) 

2. "DIR" (3.24) 

3. "DIRREC" (4.95) 


(b) 


Deseasonalization : Yes - Input Selection : No 


WINNER 


COMB 


WCOMB 


1. "MIMO-ACFLIN" (3.05) 
"DIRMO-WAVG" (3.35) 
"DIRMO-SEL" (3.45) 
"MIMO-LOO" (3.69) 
"DIRMO-AVG" (4.47) 

2. "REC" (5.47) 
"DIRREC" (6.15) 
"DIR" (6.36) 


1. "DIRMO-WAVG" (3.59) 
"MIMO-ACFLIN" (3.67) 
"MIMO-LOO" (3.77) 
"DIRMO-SEL" (3.77) 
"DIRMO-AVG" (4.34) 
" REC" (5.04) 
"DIRREC" (5.78) 
"DIR" (6.05) 


1. "DIRMO-SEL" (3.45) 
"DIRMO-WAVG" (3.59) 
"MIMO-LOO" (3.69) 
"MIMO-ACFLIN" (3.69) 
"DIRMO-AVG" (4.38) 

2. "REC" (5.40) 
"DIRREC" (5.67) 
"DIR" (6.14) 


(c) 


Deseasonalization : Yes - Input Selection : Yes 


WINNER 


COMB 


WCOMB 


1. "MIMO-ACFLIN" (1.87) 
"MIMO-LOO" (2.14) 

2. "REC" (3.05) 

3. "DIRREC" (3.79) 
"DIR" (4.14) 


1. "MIMO-LOO" (2.27) 
"MIMO-ACFLIN" (2.32) 

2. "REC" (2.96) 

3. "DIRREC" (3.50) 
"DIR" (3.94) 


1. "MIMO-LOO" (2.33) 
"MIMO-ACFLIN" (2.33) 

2. "REC" (2.93) 

3. "DIRREC" (3.50) 
"DIR" (3.91) 



(d) 

Table 9: Competition phase. Group of strategies statistically significantly different (sorted by 
increasing ranking) provided by Friedman and post-hoc tests for the 12 configurations. 
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The pre-competition results presented in the previous section suggest us to use the MIMO- 
ACFLIN strategy with the Comb model selection approach by removing the seasonality and ap- 
plying the input selection procedure, since this configuration obtains the smallest forecasting er- 
ror (18.81%). 

By using the MIMO-ACFLIN strategy and the corresponding configuration in the competition 
phase, we would generate forecasts with a SMAPE* equals to 20.28% which is quite good compared 
to the best computational intelligence entries of the competition as shown in Table 10 Figure [7] 
shows the forecasts of the MIMO-ACFLIN strategy versus the actual values for four NN5 time 
series to illustrate the forecasting capability of this strategy. 



Model 


SMAPE* 


MIMO-ACFLIN 


20.28 


Andrawis 


20.4 


Vogel 


20.5 


D'yakonov 


20.6 


Rauch 


21.7 


Luna 


21.8 


Wichard 


22.1 


Gao 


22.3 


Puma-Villanueva 


23.7 


Dang 


23.77 


Pasero 


25.3 



Table 10: Forecasting errors for different computational intelligence forecasting models which par- 
ticipate to the NN5 forecasting competition. 
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Figure 7: 



The forecast versus the actual of the MIMO-ACFLIN strategy for four NN5 time series. 
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5.3. Discussion 

From all presented results one can deduce the following observations below. These findings 
refer mainly to the pre-competition results. But, one can easily see that they mostly also apply to 
the competition phase results. 

• The overall best method is MIMO-ACFLIN, used with input selection, deseasonalization and 
equal weight combination (COMB). 

• The Multiple-Output strategies (MIMO and DIRMO) are invariably the best strategies. 
They beat the Single-Output strategies, such as DIR, REC, and DIRREC. Both MIMO 
and DIRMO give comparable performancce. For DIRMO, the selection of the parameter s 
is critical, since it has a great impact on the performance. Should there be an improved 
selection approach, this strategy would have a big potential. 

• Both versions of MIMO are comparable. Also the versions of DIRMO give close results, with 
perhaps DIRMO- WAVG a little better than the other two versions. 

• Among the Single-Output strategies, the REC strategy has almost always a smaller SMAPE 
and a better ranking than the DIR strategy. DIRREC is the worse strategy overall, and gives 
especially low accuracy when no deseasonalization is performed. 

• Deseasonalization leads to consistently better results (in 38 out of 39 models) . This result is 



consistent with some other studies, such as (Zhang and Qi 2005). The possible reason for 



this is that when no deseasonalization is performed, we are putting a higher burden on the 
model to forecast the future seasonal pattern plus the trend and the other aspects, which 
apparently is hard to simultaneously satisfy. 

Input selection is especially beneficial when we perform a deseasonalization. Absent desea- 
sonalization, the results are mixed (as to whether input selection improves the results or not). 
The possible explanation is that when no deseasonalization is performed, the model needs 
all the previous cycle to construct the future seasonal pattern. Performing an input selection 
will deprive it from essential information. 

Concerning the model selection aspect, both combination approaches (COMB and WCOMB) 
are superior to the winner-take-all (WINNER). Both COMB and WCOMB are comparable, 
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and the results do not differ by much. This is consistent with much of the findings in 



forecast combination literature, e.g. (Andrawis et al. 2011 Clemen 1989 Timmermann 



2006 Andrawis et al. 2010) 



• The relative performance and ranking of the different strategies is persistent. Most findings 
that are based on the pre-competition results are true for the competition phase results. 
This is also true for the findings concerning the deseasonalization, input selection, and model 
selection. This persistence is reassuring, as we can have some confidence in relying on the 
test or validation results for selecting the best strategies. 

• The best strategy based on the pre-competition data, the MIMO-ACFLIN method, would 
have topped all computational intelligence entries of the NN5 competition in the true com- 
petition hold-out data. 

6. Conclusion 

Forecasting a time series many steps into the future is a very hard problem because the larger the 
forecast horizon, the higher is the uncertainty. In this paper we presented a comparative review 
of existing strategies for multi-step ahead forecasting, together with an extensive comparison, 
applied on the 111 time series of the NN5 forecasting competition. The comparison gave some 
interesting lessons that could help researchers channel their experiments into the most promising 
approaches. The most consistent findings are that Multiple- Output approaches are invariably 
better than Single-Output approaches. Also, deseasonalization had a very considerable positive 
impact on the performance. Finally, the results are clearly quite persistent. So, selecting the 
best strategy based on testing performance is a very potent approach. A possible direction for 
future research could therefore be developing other new improved Multiple- Output strategies. Also, 
possibly tailoring deseasonalization methods specifically for Multiple-Output strategies could also 
be a promising research point. 
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